This document is a brief introduction to bioinformatic tools and concepts used for sequencing analysis using UNSW’s computational Cluster, Katana. The focuses is 16S rRNA analysis using QIIME2 and our MRC 16S rRNA pipeline. It will also introduce you to the phyloseq package in R for custom graphics in R.
You can find a nice introduction to HPC here. Below are some basic introductions.
What is a High Performance Cluster (HPC)
A High Performance Cluster refers to a cluster of computers used to tackle large scale analytical problems. Essentially it’s a large collection of computing resources which can be utilised for scientific research.
Compute Nodes
Compute nodes can be thought of as powerful computers/servers, which carry out the majority of the work on a HPC. Each computer node has it’s own, Operating System (OS), Central Processing Unit (CPU), Memory (RAM) and Hard Drive (HD).
Head Node
The Head node is essentially the brains behind the HPC. It is responsible for tracking and allocating computing resources to users.
Login Node
This is the node which controls user access and logon
Scheduling System
At the heart of the HPC is the software which manages the workload. Essentially it’s a program that attempts to balance utilisation across the resource available.
Queue
All jobs run on queues. Queues have differing attributes which match the jobs which run on them. These attributes can relate to things like: job run time length, number of slots, amount of memory.
Job
A job is essentially your task, the code which you ask the HPC to execute.
Katana is a shared computational cluster located on campus at UNSW that has been designed to provide easy access to computational resources for groups working with non-sensitive data. Katana will be used for your next generation sequencing analysis.
Download a terminal in order to log on to a remote terminal e.g. Iterm (Mac) or mobaxterm (Windows)
ssh zID@katana.unsw.edu.auLog on from Mobaxterm (Windows Computer)
Always create an interactive job to work on Katana - Do not run jobs on head node
qsub -I -l nodes=1:ppn=1,mem=10gb,walltime=10:00:00Always work in your scratch drive, this directory has a lot more space than your home
cd /srv/scratch/zIDA lot of programs are pre-installed on Katana, to view:
module availTo load a program
module load programUse Katana data mover (kdm) to move data. Data can be moved from desktop to remote server using a SFTP client WinSCP or FileZilla for mac. Use this to upload data (e.g. study metadata) and download your results.
ssh zID@kdm.unsw.edu.auYou can also log on to the kdm on your ssh client in the same way you log on to katana.
Use this if you are downloading data or databases to Katana
ssh zID@kdm.unsw.edu.auUNSW research technology offer a range of training courses, that as of 2020 have been moved online. These are a great introduction to general programming and will help you improve the efficiency of your work. Explore them here
Introduce yourself to basic linux command using an online course
| Command | Meaning | Example |
|---|---|---|
ls |
list files in current directory, with -l, it also displays file permissions, sizes and last updated date/time | ls -l, ls |
pwd |
displays your current location in the file system | pwd |
env |
displays your user environment settings (e.g. search path, history size and home directory) | env |
cd |
change directory | cd /file/path |
mv |
move file, change file name - Careful with this, no undo! | mv /path/file/ /newpath/file ,
mv oldname newname |
cp |
copy files, -R copy directory | cp /path/file/ /newpath/file,
cp -R ./directory ./newdirectorylocation |
head |
print top lines in file | head filename , head -10 filename |
tail |
print last lines in file | tail filename , tail -10 filename |
less |
view lines in file, use spacebar to go down file | less filename |
grep |
search file for pattern or string | cat filename |grep "aaatttcc" |
* |
wildcard, use with other commands | ls *.fastq : list all files with suffix “.fastq” |
wc |
word count, with -l = linecount | wc filename, wc -l filename |
rm |
remove filename , with -R directory Again careful | rm filenmae |
exit |
exit remote server | exit |
. |
This represents the current directory | pwd .,
cd ./path/to/subdirectory/from/currentdir |
.. |
This represents the parent directory | pwd ..,
cd ../path/to/subdirectory/from/parentdir |
When learning bioinformatics, you will most likely need to create or edit text files, shell scripts or Python scripts from the command line. Using a Unix-based text editor good practice for getting used to the environment if you are new to the command line.
There are multiple text editors available. Which one you use is based on user preference, beginners usually go for nano or vim.
Text Editors
User Guides
Useful Tip
If using vim and you are stuck in a wrong mode etc, you
can always escape using by pressing ESC followed by
:q! and this will allow you exit without saving any
changes.
Alpha diversity refers to the average species diversity in a habitat or specific area (i.e. human subject)
Taxonomic data is rarefied to an even sequencing depth prior to alpha diversity analysis. This is because samples with many more reads may look like they have a higher number of different bacteria present than samples with a low sequencing depth. We can illustrate this with a rarefaction curve.
Importance: Diversity indices provide important
information about rarity and commonness of species in a community. The
ability to quantify diversity in this way is an important tool for
biologists trying to understand community structure.
Beta diversity describes how different the microbial composition in one environment compared to another
Based on abundance or read count data
Differences in microbial abundances between two samples (e.g., at
species level) values are from 0 to 1
Based on presence or absence of species
Does
not include abundance information. Difference in microbial composition
between two samples
Based on the fraction of branch length that is shared between two samples or unique to one or the other sample
Unweighted UniFrac
Weighted UniFrac
Permutational multivariate analysis of variance (PERMANOVA) is used in the report to quantify multivariate community-level differences between groups.
PERMANOVA
is a non-parametric multivariate statistical test. It is used to compare
groups of objects and test the null hypothesis that the centroids and
dispersion of the groups as defined by measure space are equivalent for
all groups.
Dissimilarity scores can be plotted in an ordination plot to visualise sample/group similarity. Below is the example output from QIIME, you can also produce your own custom ordination graphics using R studio
In biological classification, taxonomic rank is the relative level of a group of organisms (a taxon) in a taxonomic hierarchy. Examples of taxonomic ranks are species, genus, family, order, class, phylum, kingdom, domain, etc.
QIIME2 output will output a classification at each level. You will use the taxa tables to you an insight into the specific taxa which are differentially abundant between treatment groups.
Create bin directory in your scratch drive
mkdir binMove into bin
cd /srv/scratch/zID/binDownload miniconda
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.shIn your terminal window, run
bash Miniconda3-latest-Linux-x86_64.shFollow the prompts on the installer screens. To make the changes take effect, close and then re-open your terminal window.
Test your installation. In your terminal window, run the command
conda list. A list of installed packages appears if it has
been installed correctly.
Installation guide here and summary below.
Download QIIME 2 Core 2020.2 distribution
wget https://data.qiime2.org/distro/core/qiime2-2020.2-py36-linux-conda.ymlCreate a conda environment and install the QIIME 2 Core 2020.2 distribution within the environment
conda env create -n qiime2-2020.2 --file qiime2-2020.2-py36-linux-conda.ymlNow that you have a QIIME 2 environment, activate it using the environment’s name
conda activate qiime2-2020.2To deactivate an environment
conda deactivateYou can test your installation by activating your QIIME 2 environment and running:
qiime --helpIf no errors are reported when running this command, the installation was successful!
When submitting samples for sequencing, begin sample with a character
or string followed by a numeric identifier e.g. S1, or
sample_1.
Do not submit samples with names beginning with
0, - etc… this causes problems downstream.
Metadata is most commonly stored in a TSV (i.e. tab-separated values) file. These files typically have a .tsv or .txt file extension. TSV files are simple text files used to store tabular data, and the format is supported by many types of software. You can create and edit your .tsv file in excel!
Read more about metadata QIIME2 here
QIIME 2 will also automatically validate a metadata file anytime it is used by the software. You can use Keemei to validate your metadata prior to importation into QIIME2.
In Qiime2 any cell in the metadata contains leading or trailing
whitespace characters (e.g. spaces, tabs), those characters will be
ignored when the file is loaded. Thus, leading and trailing whitespace
characters are not significant, so cells containing the values
"gut" and " gut " are equivalent.
This is NOT the case in R with "gut"
and " gut " recognised as different values, so it is best
to avoid this practice in general.
Text files created on Windows machines can have different line endings than files created on Unix/Linux. This can cause problems when analysing data. It is good to be aware of this and check this out if you run into problems during your analysis.
To avoid this after preparing the metadata file, use Noteplus++. Open the metadata file and check the end is linux format by showing symbols, if the endline is LF, correct, if it is CRLF, replace CR with blank.
Line endings in general
\r\n\r\n\nThe following format will work in both QIIME2 and R. Using
#Sample or #SampleID for the first column of
metadata will work in QIIME2 but not R.
| ID | Sample | Subject | Site | Timepoint | Diagnosis | Diagnosis_Timepoint |
|---|---|---|---|---|---|---|
| example1_S89_L001 | example1 | Subject_1 | gut | Day_0 | Cancer | Cancer_Day_0 |
| example3_S42_L001 | example3 | Subject_2 | gut | Day_0 | Cancer | Cancer_Day_0 |
| example5_S51_L001 | example5 | Subject_3 | gut | Day_0 | Healthy | Healthy_Day_0 |
| example7_S7_L001 | example7 | Subject_4 | gut | Day_0 | Healthy | Healthy_Day_0 |
| example6_S4_L001 | example6 | Subject_1 | gut | Day_4 | Cancer | Cancer_Day_4 |
| example8_S78_L001 | example8 | Subject_2 | gut | Day_4 | Cancer | Cancer_Day_4 |
| example12_S80_L001 | example12 | Subject_3 | gut | Day_4 | Healthy | Healthy_Day_4 |
| example14_S79_L001 | example14 | Subject_4 | gut | Day_4 | Healthy | Healthy_Day_4 |
| example2_S9_L001 | example2 | Subject_1 | gut | Day_5 | Cancer | Cancer_Day_5 |
| example4_S57_L001 | example4 | Subject_2 | gut | Day_5 | Cancer | Cancer_Day_5 |
| example10_S25_L001 | example10 | Subject_3 | gut | Day_5 | Healthy | Healthy_Day_5 |
| example17_S22_L001 | example17 | Subject_4 | gut | Day_5 | Healthy | Healthy_Day_5 |
Prior to this please ensure that you have completed the moving picture tutorial for QIIME2. It is important to understand the command line steps independently to the MRC pipeline in order to help you debug problems that may arise with code.
Prior to importing data into QIIME, the data will be assessed with Fastp. This step functions to assess overall sequence quality. Output information will be in a html format and will include:
| Phred Quality | Score Probability of incorrect bases call | Base call accuracy |
|---|---|---|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
For targeted filtering we use DADA2 within QIIME2. This is a relatively new method and pushed the field from the practice of binning sequences into 97% OTUs, to effectively using the sequences themselves as the unique identifiers for a taxon. Now referred to 100% Operational Taxonomic Unit (OTU)or amplicon sequence variants(AVS).
Below is the summary statistics from the targeted filtering we use DADA2.
As you would have learned in the moving pictures tutorial, to visualize the denoising stats from running dada2 denoise-paired run the following command and view the file in QIIME2 View. To download as a TSV to run statistical analysis, click the button in the left corner
qiime metadata tabulate \
--m-input-file stats-dada2.qza \
--o-visualization stats-dada2.qzv
Bowtie2 is used for the removal of host contamination. If you are working with samples other than stool then it is important to remove any host contamination; i.e. murine or human. All non-chimeric merged sequences were aligned against the host genome. Successfully aligned sequences were then removed prior to subsequent analyses.
Make directory for project - important to organise your directory and data in a logical fashion!
mkdir /srv/scratch/zID/project_directoryWithin this make directory for raw data
mkdir /srv/scratch/zID/project_directory/fastqConcatenate the fastq files for QC and importation into Qiime2
cat *R1_001.fastq > all_R1.fqcat *R2_001.fastq > all_R2.fqMove concatenated files into directory away from original fastq files
Make directory for concatenated files
mkdir /srv/scratch/zID/project_directory/cat_filesMove the concatonated file into cat_files .. example of
where you use * (wildcard)
mv all* /srv/scratch/zID/project_directory/cat_filesCreate output directory for quality control step of pipeline
mkdir /srv/scratch/zID/project_directory/qc_outThe MRC 16S rRNA pipeline (MIMA) is written in the coding language
perl and stored in /srv/scratch/mrcbio/mima/. To gain
access to this directory contact the MRC bioinformatics team.
perl /srv/scratch/mrcbio/mima/prepare_manifest_and_qc.pl <Absolute path> <output.manifest> <total_1.fq> <total_2.fq> <outputdir> <# of threads>| Option | Meaning |
|---|---|
| Absolute path | Directory to your raw fastq
files:/srv/scratch/zID/project_directory/fastq |
| Output Manifest | output_manifest |
| Total_1.fq | /srv/scratch/zID/project_directory/cat_files/all_R1.fq |
| Total_2.fq | /srv/scratch/zID/project_directory/cat_files/all_R2.fq |
| outputdir | /srv/scratch/zID/project_directory/qc_out |
| # of threads | Threads are a way for a program to divide (termed “split”) itself into two or more simultaneously running tasks, default 12 |
Example of code
perl /srv/scratch/mrcbio/mima/prepare_manifest_and_qc.pl /srv/scratch/zID/project_directory/fastq output_manifest /srv/scratch/zID/project_directory/cat_files/all_R1.fq /srv/scratch/zID/project_directory/cat_files/all_R2.fq /srv/scratch/zID/project_directory/qc_out 12Amend job file so that it works with version on qiime2 installed in your account
To do this use a text editor vim
vim fastp.pbsTo edit using vim hit i, this will allow you to insert
text, add the following text to the beginning of the fastp.pbs file
#!/bin/bash
#PBS -l nodes=1:ppn=12
#PBS -l mem=80gb
#PBS -j oe
#PBS -l walltime=100:00:00
#export LC_ALL=en_AU.utf8
#export LANG=en_AU.utf8
conda activate qiime2-2020.2To say the changes type :wq , this will save and quit,
if you made a mistake and would like to quit without saving type
:q!
Once your changes have been made submit your job :
qsub fastp.pbs
To view it’s progress qstat | grep zID
Make output directory for host remove output
mkdir /srv/scratch/zID/project_directory/remove_hostPipeline command for remove host comtination
perl /srv/scratch/mrcbio/mima/decontaminate_host.pl <rep_seqs.qza> <table.qza> <outputdir> <qiime_version_qiime2-2020.2> <host_genome.fa_bowtie2_index>| Option | Meaning |
|---|---|
| rep_seqs.qza | ./project_directory/qc_out/rep_seqs.qza |
| table.qza | ./project_directory/qc_out/table.qza |
| outputdir | ./project_directory/remove_host |
| qiime_version_qiime2-2020.2 | qiime2-2020.2 |
| host_genome.fa_bowtie2_index | Human genome: /srv/scratch/mrcbio/humangenome/, Murine
genome: /srv/scratch/mrcbio/mousegenome/ |
In your interactive job either run or set up job similarly to above
bash decontaminationhost.pbsThis plot shows that as the sequencing depth (Sample Size) gets higher, more species are observed. One way to overcome this is to normalise by subsampling to even sequencing depths. This is like putting all of your counts for a sample in a bag and randomly pulling out a set number of counts so that all samples end up having the same total (refer to the “rarefying”). This approach is widely criticised because it can lead to throwing out huge amounts of data, but it remains one of the best (and most simple) ways to normalise before estimating alpha-diversity since some metrics rely on actual counts (integers).
Use sequencing depth for -d option in
diversity-basic-statistic.pl to indicate the sampling depth
you want your sequences rarefied too.
table-summarize.qzv and visualise the data in Qiime view.Make directory for output of basic diversity analysis
mkdir /srv/scratch/zID/project_directory/basic_diversityMRC pipeline command for basic diversity analysis. This script will produce the code to analysis the basic diversity metrics for all metadata factors in your metadata
perl /srv/scratch/mrcbio/mima/diversity-basic-statistic.pl -m sample.metatdata.tsv -c clasifier.qza -o output_dir -n [threads number] -h [verbose]| Option | Meaning |
|---|---|
| -m sample.metatdata.tsv | ./project_directory/sample_metadata.tsv |
| -c clasifier.qza | ./bin/db/clasifier.qza |
| -o output_dir | ./project_directory/basic_diversity |
| -n [threads number] | Threads are a way for a program to divide (termed “split”) itself into two or more simultaneously running tasks, default 12 |
| -d depth | depth of default [5000] - sequencing depth |
To find out the sequencing depth of samples run the following in QIIME2 use FileZilla to download table-summarize.qzv output from host DNA with view with QIIME2 view
For this make sure you have access to a classifier, you may need to download a 16S rRNA database and train your own classifer. Commonly used databases include Greengenes and Silvia (Silva more up to date!)
Read the documentation and give it a go!
The most commonly used 16S rRNA primers used in the MRC are the 16S_V3-V4 (341f-805r). For different primer types the databases will need to be re-trained.
341F: CCTACGGGNGGCWGCAG
805R: GACTACHVGGGTATCTAATCC
As mentioned above you will use the taxa tables to you an insight into the specific taxa which are differentially abundant between treatment groups.
You can visualise relative taxonomic composition of your samples and
your download the normalised taxonomy table using
normalized.taxa-bar-plots.qzv with QIIME2 view
This section is mainly to highlight ways to present your data if you are conducting a longitudinal study and to highlight some additional features you may not have come across in qiime2. The documentation on qiime2 is excellent and please use it to learn about these extra plugin features. There are also some tutorials incorporating a longitudinal dataset.
Example code to produce the .qzv volatility output
file
qiime longitudinal volatility
--m-metadata-file metadata.txt #metadata file (N.B. important to have this formatted logically - must match previous analysis to obtain core metric results)
--m-metadata-file ./core-metrics-results/shannon_vector.qza #alpha diversity output (obtained from earlier analysis)
--p-default-metric shannon #metric
--p-default-group-column treatment #treatment metadata i.e. gestational diabetes/healthy
--p-state-column Timepoint #timepoint metadata data e.g. trimester
--p-individual-id-column mouse_ID #subject
--o-visualization volatility_shannon.qzv #outputView, manipulate and download this .qzvfile in qiime2 view
Utilise UNSW introductory courses or online resources to R in order to familarise yourself with the fundamentals. Introductory course timetables can be found here
Download R studiohere
From your sequence analysis in QIIME2 download to your desktop
table.qzarooted-tree.qzataxonomy.qzasample-metadata.tsvInstall required packages
if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
devtools::install_github("jbisanz/qiime2R")The R package phyloseq is designed specifically for analysing microbiome data. For more information, see the homepage.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")In R studio set working directory
setwd("/Desktop/filepath/")Create temporary directory
dir.create("tmpdir")Import data as phyloseq object. A phyloseq object holds all of the data necessary for the analysis in a single place. It can contain: metadata for the samples, the abundance table, taxonomic assignments, a phylogenetic tree, and reference sequences.
phy_obj<-qza_to_phyloseq("rarefy.table.qza", "rooted-tree.qza", "taxonomy.qza", "sample-metadata.tsv", tmp="tmpdir")Use phyloseq tutorial for publication quality graphics